{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP and NLTK Basics\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SparkContext and SparkSession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark import SparkContext\n",
    "sc = SparkContext(master = 'local')\n",
    "\n",
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder \\\n",
    "          .appName(\"Python Spark SQL basic example\") \\\n",
    "          .config(\"spark.some.config.option\", \"some-value\") \\\n",
    "          .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A lot of examples in this article are borrowed from the book written by **Bird et al. (2009)**. Here I tried to implement the examples from the book with spark as much as possible.\n",
    "\n",
    "Refer to the book for more details: Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. \" O'Reilly Media, Inc.\", 2009.\n",
    "\n",
    "## Basic terminology\n",
    "\n",
    "* **text**: a sequence of words and punctuation.\n",
    "* **frequency distribution**: the frequency of words in a text object.\n",
    "* **collocation**: a **sequence of words** that occur together unusually often.\n",
    "* **bigrams**: word pairs. High frequent bigrams are collocations.\n",
    "* **corpus**: a large body of text\n",
    "* **wordnet**: a lexical database in which english words are grouped into sets of synonyms (**also called synsets**).\n",
    "* **text normalization**: the process of transforming text into a single canonical form, e.g., converting text to lowercase, removing punctuations and so on.\n",
    "* **Lemmatization**: the process of grouping variant forms of the same word so that they can be analyzed as a single item.\n",
    "* **Stemming**: the process of reducing inflected words to their **word stem**.\n",
    "* **tokenization**:\n",
    "* **segmentation**:\n",
    "* **chunking**:\n",
    "\n",
    "## Texts as lists of words\n",
    "\n",
    "Create a data frame consisting of text elements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------------------------------------+\n",
      "|texts                                   |\n",
      "+----------------------------------------+\n",
      "|[I, like, playing, basketball]          |\n",
      "|[I, like, coding]                       |\n",
      "|[I, like, machine, learning, very, much]|\n",
      "+----------------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pdf = pd.DataFrame({\n",
    "        'texts': [['I', 'like', 'playing', 'basketball'],\n",
    "                 ['I', 'like', 'coding'],\n",
    "                 ['I', 'like', 'machine', 'learning', 'very', 'much']]\n",
    "    })\n",
    "    \n",
    "df = spark.createDataFrame(pdf)\n",
    "df.show(truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ngrams and collocations\n",
    "\n",
    "Transform texts to 2-grams, 3-grams and 4-grams collocations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import NGram\n",
    "from pyspark.ml import Pipeline\n",
    "ngrams = [NGram(n=n, inputCol='texts', outputCol=str(n)+'-grams') for n in [2,3,4]]\n",
    "\n",
    "# build pipeline model\n",
    "pipeline = Pipeline(stages=ngrams)\n",
    "\n",
    "# transform data\n",
    "texts_ngrams = pipeline.fit(df).transform(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------------------------------------------------------+\n",
      "|2-grams                                                           |\n",
      "+------------------------------------------------------------------+\n",
      "|[I like, like playing, playing basketball]                        |\n",
      "|[I like, like coding]                                             |\n",
      "|[I like, like machine, machine learning, learning very, very much]|\n",
      "+------------------------------------------------------------------+\n",
      "\n",
      "+----------------------------------------------------------------------------------+\n",
      "|3-grams                                                                           |\n",
      "+----------------------------------------------------------------------------------+\n",
      "|[I like playing, like playing basketball]                                         |\n",
      "|[I like coding]                                                                   |\n",
      "|[I like machine, like machine learning, machine learning very, learning very much]|\n",
      "+----------------------------------------------------------------------------------+\n",
      "\n",
      "+---------------------------------------------------------------------------------+\n",
      "|4-grams                                                                          |\n",
      "+---------------------------------------------------------------------------------+\n",
      "|[I like playing basketball]                                                      |\n",
      "|[]                                                                               |\n",
      "|[I like machine learning, like machine learning very, machine learning very much]|\n",
      "+---------------------------------------------------------------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# display result\n",
    "texts_ngrams.select('2-grams').show(truncate=False)\n",
    "texts_ngrams.select('3-grams').show(truncate=False)\n",
    "texts_ngrams.select('4-grams').show(truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Access corpora from the NLTK package\n",
    "\n",
    "### The `gutenberg` corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get file ids in gutenberg corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['austen-emma.txt',\n",
       " 'austen-persuasion.txt',\n",
       " 'austen-sense.txt',\n",
       " 'bible-kjv.txt',\n",
       " 'blake-poems.txt',\n",
       " 'bryant-stories.txt',\n",
       " 'burgess-busterbrown.txt',\n",
       " 'carroll-alice.txt',\n",
       " 'chesterton-ball.txt',\n",
       " 'chesterton-brown.txt',\n",
       " 'chesterton-thursday.txt',\n",
       " 'edgeworth-parents.txt',\n",
       " 'melville-moby_dick.txt',\n",
       " 'milton-paradise.txt',\n",
       " 'shakespeare-caesar.txt',\n",
       " 'shakespeare-hamlet.txt',\n",
       " 'shakespeare-macbeth.txt',\n",
       " 'whitman-leaves.txt']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import gutenberg\n",
    "\n",
    "gutenberg_fileids = gutenberg.fileids()\n",
    "gutenberg_fileids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Absolute path of a file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "FileSystemPathPointer('/Users/mingchen/nltk_data/corpora/gutenberg/austen-emma.txt')"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gutenberg.abspath(gutenberg_fileids[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Raw text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'[Emma by Jane Austen 1816]\\n\\nVOLUME I\\n\\nCHAPTER I\\n\\n\\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\\nand happy disposition, seemed to unite some of the best blessings\\nof existence; an'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gutenberg.raw(gutenberg_fileids[0])[:200]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The words of the entire corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gutenberg.words()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2621613"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(gutenberg.words())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Sentences of a specific file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gutenberg.sents(gutenberg_fileids[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7752"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(gutenberg.sents(gutenberg_fileids[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading custom corpus\n",
    "\n",
    "Let's create a corpus consisting all files from the **./data** directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from nltk.corpus import PlaintextCorpusReader\n",
    "corpus_data = PlaintextCorpusReader('./data', '.*')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Files in the corpus *corpus_data*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Advertising.csv',\n",
       " 'Credit.csv',\n",
       " 'WineData.csv',\n",
       " 'churn-bigml-20.csv',\n",
       " 'churn-bigml-80.csv',\n",
       " 'cuse_binary.csv',\n",
       " 'horseshoe_crab.csv',\n",
       " 'hsb2.csv',\n",
       " 'hsb2_modified.csv',\n",
       " 'iris.csv',\n",
       " 'mtcars.csv',\n",
       " 'prostate.csv',\n",
       " 'twitter.txt']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_fileids = corpus_data.fileids()\n",
    "data_fileids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Raw text in *twitter.txt*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Fresh install of XP on new computer. Sweet relief! fuck vista\\t1018769417\\t1.0\\nWell. Now I know where to go when I want my knives. #ChiChevySXSW http://post.ly/RvDl\\t10284216536\\t1.0\\n\"Literally six weeks before I can take off \"\"SSC Chair\"\" off my email. Its like the torturous 4th mile before everything stops hurting.\"\\t10298589026\\t1.0\\nMitsubishi i MiEV - Wikipedia, the free encyclopedia - http://goo.gl/xipe Cutest car ever!\\t109017669432377344\\t1.0\\n\\'Cheap Eats in SLP\\' - http://t.co/4w8gRp7\\t109642968603963392\\t1.0\\nTeenage Mutant Ninja Turtle art is never a bad thing... http://bit.ly/aDMHyW\\t10995492579\\t1.0\\nNew demographic survey of online video viewers: http://bit.ly/cx8b7I via @KellyOlexa\\t11713360136\\t1.0\\nhi all - i\\'m going to be tweeting things lookstat at the @lookstat twitter account. please follow me there\\t1208319583\\t1.0\\nHoly carp, no. That movie will seriously suffer for it. RT @MouseInfo: Anyone excited for The Little Mermaid in 3D?\\t121330835726155776\\t1.0\\n\"Did I really need to learn \"\"I bought a box and put in it things\"\" in arabic? This is the most random book ever.\"\\t12358025545\\t1.0\\n'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus_data.raw('twitter.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Words and sentences in file *twitter.txt*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Fresh', 'install', 'of', 'XP', 'on', 'new', ...]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus_data.words(fileids='twitter.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "253"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(corpus_data.words(fileids='twitter.txt'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Fresh', 'install', 'of', 'XP', 'on', 'new', 'computer', '.'], ['Sweet', 'relief', '!'], ...]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus_data.sents(fileids='twitter.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(corpus_data.sents(fileids='twitter.txt'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## WordNet\n",
    "\n",
    "The `nltk.corpus.wordnet.synsets()` function load all synsents with a given lemma and part of speech tag.\n",
    "\n",
    "Load all synsets into a spark data frame given the lemma `car`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------+\n",
      "|   car_synsets|\n",
      "+--------------+\n",
      "|      car.n.01|\n",
      "|      car.n.02|\n",
      "|      car.n.03|\n",
      "|      car.n.04|\n",
      "|cable_car.n.01|\n",
      "+--------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import wordnet\n",
    "wordnet.synsets\n",
    "pdf = pd.DataFrame({\n",
    "        'car_synsets': [synsets._name for synsets in wordnet.synsets('car')]\n",
    "    })\n",
    "df = spark.createDataFrame(pdf)\n",
    "df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get lemma names given a synset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['car', 'railcar', 'railway_car', 'railroad_car']"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pyspark.sql.functions import udf\n",
    "from pyspark.sql.types import *\n",
    "from nltk.corpus import wordnet\n",
    "\n",
    "def lemma_names_from_synset(x):\n",
    "    synset = wordnet.synset(x)\n",
    "    return synset.lemma_names()\n",
    "\n",
    "lemma_names_from_synset('car.n.02')\n",
    "# synset_lemmas_udf = udf(lemma_names_from_synset, ArrayType(StringType()))\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
